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Power consumption and power density for the Translation Lookaside Buffer (TLB) are 
important considerations not only in its design, but can have a consequence on cache 
design as well. This paper embarks on a new philosophy for reducing the number of 
accesses to the instruction TLB (iTLB) for power and performance optimizations. The 
overall idea is to keep a translation currently being used in a register and avoid going to 
the ITLB as far as possible — until there is a page change. We propose f ... 

22 Poster Session 3: TLB and snoop energy-reduction using virtual caclies in low-power Q 

chip -multi processors 

Magnus Ekman, Per Stenstrom, Fredrik Dahlgren 

August 2002 Proceedings of the 2002 international symposium on Low power 
electronics and design 

Publisher: ACM Press 
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Full text available:^ pdf( 84.59 KB ) 



In our quest to bring down the power consumption in low-power chip-multiprocessors, we 
have found that TLB and snoop accesses account for about 40% of the energy wasted by 
all LI data-cache accesses. We have investigated the prospects of using virtual caches to 
bring down the number of TLB accesses. A key observaction is that while the energy 
wasted in the TLBs are cut, the energy associated with snoop accesses becomes higher. 
We then contribdute with two techniques to reduce the number of snoop ... 
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Modern processors have various features for latency tolerance such as Hit-under-miss, 
Out-of-order execution, or Multi-threading. However, many processors must make a 
precise trap for a TLB miss, because they maintain the TLB with software and cannot 
distinguish the TLB scarcity from the page fault. It is very important for the application 
and/or the operating system to avoid the TLB misses as much as possible. Many 
processors have some super page features that extend the coverage of the TLB sig ... 

24 Energy-aware compiling and schedu ling : Compiler-directed code restructu ring for 

reducin g data TLB energ y 
M. Kandemir, I. Kadayif, G. Chen 

September 2004 Proceedings of the 2nd lEEE/ACM/IFIP international conference on 

Hardware/software codesign and system synthesis 
Publisher: ACM Press 

Full text available: ^pdf( 1 72.79 KB) Additional Information: f ull cita tion, abstract, referenc es, index tenns 

Prior work on TLB power optimization considered circuit and architectural techniques. A 
recent software-based technique for data TLBs has considered the possibility of storing the 
frequently used virtual-to-physical address translations in a set of translation registers 
(TRs), and using them when necessary instead of going to the data TLB. This paper 
presents a compiler-based strategy for increasing the effectiveness of TRs. The idea is to 
restructure the application code in such a fashion that ... 

Keywords: code restructuring 
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Naila Rahman, Rajeev Raman 

December 2001 Journal of Experimental Algorithmics (JEA), volume 6 
Publisher: ACM Press 

Full text available: pdf(446^81_KB) 

ps(360.14 KB) Additional Information: full citation , abstract , references , citing s, index 

I tar(7 06.56 KB ) 

We demonstrate the importance of reducing misses in the translation-lookaside buffer 
(TLB) for obtaining good performance on modern computer architectures. We focus on 
least-significantbit first (LSB) radix sort, standard implementations of which make many 
TLB misses. We give three techniques which simultaneously reduce cache and TLB misses 
for LSB radix sort: reducing working set size, explicit block transfer and pre-sorting. We 
note that: • All the techniques above yield algorithms whose ... 

Keywords: cache, efficient sorting algorithms, external-memory algorithms, locality of 
reference, memory hierarchy, radix sort, translation-lookaside buffer (TLB) 
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D. L. Black, R. F. Rashid, D. B. Golub, C. R. Hill 

April 1989 ACM SIGARCH Computer Architecture News , Proceedings of the third 
international conference on Architectural support for programming 
languages and operating systems ASPLOS-ZII, volume i7 issue 2 
Publisher: ACM Press 

Full text available- IS odfd 38 MB) Additional Information: full citation , abstract , references , citin gs, index 
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We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, 
and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. 
This algorithm has been implemented on several multiprocessors, and is in regular 
production use. Performance evaluations establish the basic costs of the algorithm and 
show that it has minimal impact on application performance. As a result, TLB consistency 
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does not pose an insurmountable obstacle to multiprocessor . 
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June 2001 ACM SIGMETRZCS Performance Evaluation Review , Proceedings of the 
2001 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems SIGMETRICS '01, volume 29 issue i 
Publisher: ACM Press 

Full text available: ^pdf( 1.55MB) Additional Information: full citation, abstract , references , citing s 

This paper studies the memory behavior of important Java workloads used in 
benchmarking Java Virtual Machines (JVi^s), based on instrumentation of both application 
and library code in a state-of-the-art JVM, and provides structured information about these 
workloads to help guide systems' design. We begin by characterizing the inherent memory 
behavior of the benchmarks, such as information on the breakup of heap accesses among 
different categories and on the hotness of references to fields and met ... 

28 Options for dynamic address translation in COMAs Q 
^ Xiaogang Qiu, Michel Dubois 

April 1998 ACM SIGARCH Computer Architecture News , Proceedings of the 25th 

annual International symposium on Computer architecture ISCA *98, volume 

26 Issue 3 

Publisher: IEEE Computer Society, ACM Press 

Full text available:^ ^^^^^ Additional Information: full citation, abstract , references , citings, index 

BMsheiSlLe 

In modern processors, the dynamic translation of virtual addresses to support virtual 
memory is done before or in parallel with the first-level cache access. As processor 
technology improves at a rapid pace and the working sets of new applications grow 
insatiably the latency and bandwidth demands on the TLB (Translation Lookaside Buffer) 
are getting more and more difficult to meet. The situation is worse in multiprocessor 
systems, which run larger applications and are plagued by the TLB consiste ... 

29 Hi g h-bandwidth address translation for multiple-issue processors Q 
Todd M. Austin, Gurindar S. Sohi 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 
24 Issue 2 
Publisher: ACM Press 

Full text available:« Ddf(lJ6MB) '"^^"^^tion: full citation , abstract, references , citings, index 

terms 

In an effort to push the envelope of system performance, microprocessor designs are 
continually exploiting higher levels of instruction-level parallelism, resulting in increasing 
bandwidth demands on the address translation mechanism. Most current microprocessor 
designs meet this demand with a multi-ported TLB. While this design provides an excellent 
hit rate at each port, its access latency and area grow very quickly as the number of ports 
is increased. As bandwidth demands continue to increase ... 

30 Modif ying VM hardware to reduce address pin recuirements Q 
Matthew Farrens, Arvin Park, Gary Tyson 

December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture MICRO 25, volume 23 issue 

1-2 

Publisher: IEEE Computer Society Press, ACM Press 
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Madhusudhan Talluri, Shing Kong, Mark D. Hill, David A. Patterson 
April 1992 ACM SIGARCH Computer Architecture News , Proceedings of the 19th 

annual international symposium on Computer architecture ISCA '92, volume 

20 Issue 2 

Publisher: ACM Press 

Full text available' 151 Ddfd 18MB) Additional Information: full citation , abstract , references, citing s, index 

terms 

As computer system main memories get larger and processor cycles-per-instruction (CPIs) 
get smaller, the time spent in handling translation lookaside buffer (TLB) misses could 
become a performance bottleneck. We explore relieving this bottleneck by (a) increasing 
the page size and (b) supporting two page sizes. We discuss how to build a TLB to support 
two page sizes and examine both alternatives experimentally with a dozen 
unlprogrammed, user-mode traces for the SPARC architectur ... 

32 Architectural su p port for translation table mana g ement in lar g e address space I I 
m achine s 

Jerry Huck, Jim Hays 

May 1993 ACM SIGARCH Computer Architecture News , Proceedings of the 20th 

annual international symposium on Computer architecture ISCA '93, volume 

21 Issue 2 
Publisher: ACM Press 

Full text available' DdfM 34 MB) Additional Information: full citation , abstract , references , citings , index 
■ T^d^-- ^ terms 

Virtual memory page translation tables provide mappings from virtual to physical 
addresses. When the hardware controlled Translation Lookaside Buffers (TLBs) do not 
contain a translation, these tables provide the translation. Approaches to the structure and 
management of these tables vary from full hardware implementations to complete 
software based algorithms. The size of the virtual address space used by processes is 
rapidly growing beyond 32 bits of address. As the utilized ad ... 

33 Design tradeoffs for software-managed TLBs Q 
David Nagle, Richard Uhlig, Tim Stanley, Stuart Sechrest, Trevor Mudge, Richard Brown 
May 1993 ACM SIGARCH Computer Architecture News , Proceedings of the 20th 

annual international symposium on Computer architecture ISCA '93, volume 

21 Issue 2 
Publisher: ACM Press 

Full text available- IB Ddf( 1 14 MB) Additional Information: full citation , abstract , references , citings, index 
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An Increasing number of architectures provide virtual memory support through software- 
managed TLBs. However, software management can impose considerable penalties, which 
are highly dependent on the operating system's structure and Its use of virtual memory. 
This work explores software-managed TLB design tradeoffs and their interaction with a 
range of operating systems including monolithic and microkernel designs. Through 
hardware monitoring and simulations, we explore TLB performance for be ... 

34 A new pa g e table for 64-bit address spaces I I 

M. Talluri, M. D. Hill, Y. A. Khalidi 

December 1995 ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth 
ACM symposium on Operating systems principles SOSP '95, volume 29 

Issue 5 

Publisher: ACM Press 
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D. Nagle, R. Uhlig, T. Mudge, S. Sechrest 
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April 1994 ACM SIGARCH Computer Architecture News , Proceedings of the 21ST 
^ annual international symposium on Computer architecture ISCA '94, volume 

22 Issue 2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available' 15!| pdf(1.27 MB ) Additional Information: full citation , abstract , references , citing s, index 
' terms 

The allocation of die area to different processor components is a central issue in the design 
of single-chip microprocessors. Chip area is occupied by both core execution logic, such as 
ALU and FPU datapaths, and memory structures, such as caches, TLBs, and write buffers. 
This work focuses on the allocation of die area to memory structures through a 
cost/benefit analysis. The cost of memory structures with different sizes and associativities 
is estimated by using an established area model for on ... 

36 O ptimizin g database architecture for the new bottleneck: memory access Q 
Stefan Manegold, Peter A. Boncz, Martin L. Kersten 

December 2000 The VLDB Journal — The International Journal on Very Large Data 

Bases, volume 9 issue 3 
Publisher: Springer-Verlag New York, Inc. 

Full text available: ^ pdf(357. 33 KB ) Additional Information: fulLcitation. abstr act, citings, index term s 

In the past decade, advances in the speed of commodity CPUs have far out-paced 
advances in memory latency. Main-memory access is therefore increasingly a performance 
bottleneck for many computer applications, including database systems. In this article, we 
use a simple scan test to show the severe impact of this bottleneck. The insights gained 
are translated into guidelines for database architecture, in terms of both data structures 
and algorithms. We discuss how vertically fragmented data struc ... 

Keywords: Decomposed storage model. Implementation techniques, Join algorithms, 
Main-memory databases, Memory access optimization. Query processing 
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38 The use of multithreadin g for exception handling Q 
Craig B. Zilles, Joel S. Emer, Gurindar S. Sohi 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium 

on Microarchitecture 
Publisher: IEEE Computer Society 

Full text available:^ p^^^^ ^ Additional Information: fylJ_citation. abst ract, references, citings, index 
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Common hardware exceptions, when implemented by trapping, unnecessarily serialize 
program execution in dynamically scheduled superscalar processors. To avoid the 
consequences of trapping the main program thread, multithreaded CPUs can exploit 
control and data independence by executing the exception handler in a separate hardware 
context. The main thread doesn't squash instructions after the excepting instruction, 
conserving fetch bandwidth and allowing execution of instructions inde ... 

39 Desi g n tradeoffs for software-mana g ed TLBs Q 
Richard Uhlig, David Nagle, Tim Stanley, Trevor Mudge, Stuart Sechrest, Richard Brown 

August 1994 ACM Transactions on Computer Systems (TOGS), volume 12 issue 3 

Publisher: ACM Press 
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An increasing number of architectures provide virtual memory support through software- 
managed TLBs. However, software management can Impose considerable penalties that 
are highly dependent on the operating system's structure and its use of virtual memory. 
This work explores software-managed TLB design tradeoffs and their interaction with a 
range of monolithic and microkernel operating systems. Through hardware monitoring and 
simulation, we explore TLB performance for benchmarks running on a ... 

Keywords: hardware monitoring, translation lookaside buffer (TLB), trap-driven 
simulation 



40 Effect_of node size on the perf orman ce of cache-co nscio us BMrees Q 
Richard A. Hankins, Jignesh M. Pate! 

June 2003 ACM SIGMETRICS Performance Evaluation Review , Proceedings of the 
2003 ACM SIGMETRICS international conference on Measurement and 
modeling of computer systems SIGMETRICS '03, volume 3i issue i 
Publisher: ACM Press 

Full text available: g pdf(271.16 KB ) Additional Information: full citation, abstract , references. Index terms 

In main-memory databases, the number of processor cache misses has a critical impact on 
the performance of the system. Cache-conscious indices are designed to improve 
performance by reducing the number of processor cache misses that are incurred during a 
search operation. Conventional wisdom suggests that the index's node size should be 
equal to the cache line size in order to minimize the number of cache misses and improve 
performance. As we show in this paper, this design choice ignores additi ... 

Keywords: B+-tree, cache-conscious, index 
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